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Abstract 

Background: Short linear protein motifs are attracting increasing attention as functionally independent sites, 
typically 3-10 amino acids in length that are enriched in disordered regions of proteins. Multiple methods have 
recently been proposed to discover over-represented motifs within a set of proteins based on simple regular 
expressions. Here, we extend these approaches to profile-based methods, which provide a richer motif 
representation. 

Results: The profile motif discovery method MEME performed relatively poorly for motifs in disordered regions of 
proteins. However, when we applied evolutionary weighting to account for redundancy amongst homologous 
proteins, and masked out poorly conserved regions of disordered proteins, the performance of MEME is equivalent 
to that of regular expression methods. However, the two approaches returned different subsets within both a 
benchmark dataset, and a more realistic discovery dataset. 

Conclusions: Profile-based motif discovery methods complement regular expression based methods. Whilst 
profile-based methods are computationally more intensive, they are likely to discover motifs currently overlooked 
by regular expression methods. 
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Background 

In protein-protein interaction networks, hub proteins 
are defined as those that interact with a number of other 
proteins, either simultaneously or at different times. 
Whilst domain- domain interactions are important for 
stable interactions, rapid low affinity interactions me- 
diated by short linear motifs are important for more 
transient interactions, for example in signal transduction 
[1,2]. Short linear motifs (SLiMs) are typically 3-10 resi- 
due stretches of a protein sequence, with two or more 
non-wildcard positions that independently mediate a 
range of functions. They may be involved in ligand bin- 
ding, modification, targeting and cleavage [3], all of which 
are important in driving cell signaling [1,4], Motifs can 
act in a coordinated and co-operative manner to exhibit 
functional regulatory complexity within the cell [5]. 
Therefore, the known repertoire of protein modules 
needs to be expanded to include smaller functional 
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sites like SLiMs, in addition to well-characterised do- 
main modules. This will advance understanding of 
the fundamental mechanisms that drive protein-protein 
interactions. 

The current (2012) release of the Eukaryotic Linear 
Motif (ELM) database lists 174 experimentally validated 
short protein motifs [3], and the MiniMotif Database 
contains around 5,000 predicted motifs [6]. Databases 
dealing with motifs containing sites for post-transla- 
tional modificatoins (PTMs) alone list in the region of 
tens of thousands of motifs [7,8]. Surveys have shown 
that up to 30% of the human proteome is disordered [9] . 
Disordered regions are known to be rich in linear motifs 
[10,11]. Given the relatively low number of motifs so far 
identified, it is clear that much work is still to be done 
[10]. Therefore, it is imperative that new tools are deve- 
loped to meet this challenge. 

One approach to discovering short functional se- 
quence motifs is to apply computational tools to find a 
motif that is over-represented among a group of evolu- 
tionarily unrelated sequences that have a related func- 
tion (e.g. they bind a common interacting protein). In 
DNA motif discovery, profile-based methods have been 
very successful in the identification and classification of 
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transcription factor and promoter binding sites [12]. 
However, profile-based methods have not been as widely 
used in the search for protein motifs since the first pub- 
lication of the MEME tool for motif discovery [13]: com- 
putational methods for the discovery of protein motifs 
[14,15] have focused on regular expression over-repre- 
sentation, whilst profile-based discovery programs such 
as MEME [16] have been largely confined to DNA 
analysis. 

Profile-based methods aim to describe the motif in 
terms of the relative frequencies of amino acids at each 
position. The regular expression [DE] allows Aspartic 
and Glutamic acid at that position, and does not define 
the relative frequency with which they are found at that 
position. However, in a profile-based definition it is pos- 
sible to state that Aspartic acid is present 70% of the 
time, and Glutamic acid 30%. This allows a more refined 
definition of the motif. Various methods have been pro- 
posed to define a profile of a linear motif - summarized 
in [17]. Regular expressions are commonly used to at- 
tempt to capture the relevant sequence information 
about a linear motif [14]. Such representations of motifs 
have been favoured by biologists as they are sometimes 
more intuitive than profiles. However, profile-based 
representations may present certain potential advan- 
tages: they can sometimes provide a richer and more 
accurate representation of certain motifs, and can have 
a visual representation (using a "sequence logo" [18]) 
that is often more easily understood than some of the 
more complex regular expressions for highly redundant 
motifs. 

The shortness of SLiMs makes their discovery difficult, 
because of the resulting difficulty in distinguishing true 
positive from false positive matches. This difficulty is 
further compounded by the degree of variation between 
instances of a linear motif. Thus, careful evaluation in a 
realistic setting of biological discovery is needed to de- 
termine if methods are useful in practice. Many motifs 
lie in disordered regions of proteins, and the motifs are 
often distinguished by greater evolutionary conservation 
among orthologues; this property allows a focus on evo- 
lutionarily conserved residues to increase the chance of 
discovering novel motifs [19]. Focussing on regions con- 
served in orthologues by masking out non-conserved 
disordered residues, and discounting motifs recurrent 
simply amongst homologous rather than unrelated pro- 
teins, have been shown to greatly improve the perform- 
ance of regular expression based methods to both 
identify true instances of known motifs, and to discover 
novel motifs [14,15]. In this paper, we investigate to what 
extent the application of evolutionary weighting and 
masking of protein sequences can improve the perfor- 
mance of profile-based methods to discover short linear 
protein motifs. 



Methods 

A set of protein sequences is used as the query set. 
These sequences, or a subset of them, are believed to 
contain a common motif responsible for a functional 
activity. The motif is likely to be relatively conserved 
between orthologues of the proteins in other related spe- 
cies, in contrast to generally unconserved surrounding 
disordered regions of the protein. We used the relative 
local conservation scoring system described in Davey 
et al. 2009 [19] to mask out unconserved residues of the 
query sequences before submitting the sequences to the 
MEME program to discover over-represented SLiM pro- 
files amongst the sequences [16]. In addition, we used 
the SLiMBuild algorithm from SLiMFinder to produce 
weightings of the relatedness of the query sequences to 
each other [15]. 

Additional masking of the query sequences to remove 
transmembrane regions and domains, taken from Uni- 
Prot annotation [20], was performed in order to increase 
the likelihood of identifying linear motifs in the query 
sequences, by eliminating such sources of high-scoring 
false positives. 

Previous work by Fuxreiter et al. has shown that disor- 
dered regions are enriched for short linear motifs [21]. 
This has been confirmed by a separate analysis of experi- 
mentally validated motifs from the ELM database [22]. 
Both indicate that the residues that comprise the motif 
are likely to have high disorder propensities as compared 
to the flanking regions. A cutoff of 0.3 ensures a balance 
between reducing the search space excessively whilst re- 
moving regions of the protein known to be ordered. 
From Davey et al. [22] 82% of known motifs have a dis- 
order score over 0.3 the cut-off used in this analysis. 

Orthology detection and alignment generation 

In order to generate alignments for the proteins in the 
benchmark dataset, we used the series of metazoan 
Ensembl whole genomes downloaded in March 2010 
[23]. We follow the method used in the Gopher ortho- 
logous protein identification and alignment algorithm 
described in Edwards et al. [24]. Each query sequence in 
the set was searched using BLAST (masking out low 
complexity regions) against the metazoan proteome at 
an expectation threshold of e = 10-4. The set of hits from 
this search was then used to search against the database 
again at a relaxed threshold of e = 10, but without com- 
plexity filtering. Sequences at this stage had to have 40% 
global similarity to the original query for inclusion. The 
most similar sequence for each species was retained for 
inclusion in the alignment. Multiple sequence align- 
ments were then generated using the MUSCLE program 
[25]. 

We adopted the treatment of evolutionary information 
previously developed and evaluated for SLiM discovery 
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[15,26], since the problem of treating evolutionary infor- 
mation is likely to be very similar for both profile and 
regular expression discovery of linear motifs. Improving 
public orthology resources such as those of Ensembl 
[23] may prove useful in future implementations of the 
method, accelerating calculations. 

Relative local conservation 

Disordered regions have different patterns of conserva- 
tion compared to structured regions of protein sequen- 
ces. Therefore, traditional multiple sequence alignments 
are not particularly informative when analyzing the con- 
servation of proteins that include disordered regions, 
since the pattern of conservation is dominated by the 
pattern of order and disorder across the sequence. To 
overcome this, we applied relative local conservation 
(RLC) masking [19], which assesses the conservation of 
disordered residues relative to adjacent disordered resi- 
dues, which we summarise briefly below. 

Residues are marked as belonging to one of two struc- 
tural states: ordered or disordered; using the IUPred dis- 
order prediction program (short setting, threshold 0.3) 
[27]. Only residues within, 25 residues to either side that 
were in the same structural state were compared. Then 
the residues in each column of the alignment were 
scored for conservation. 

The RLC is calculated for each residue using a mul- 
tiple sequence alignment of the protein with its ortholo- 
gues. We used Ensembl release 60 metazoan proteomes 
to generate the alignments [23]. Then the residues in 
each column of the alignment were scored for conserva- 
tion. For each residue, this was compared against the 
background mean conservation in a window of 25 resi- 
dues to either side of the residue across the sequence. 
Strongly conserved residues are given a high score with 
more variable residues given a low score [19]. The RLC 
score for the residue is calculated (in the manner of a 
standard Z-score) by subtracting the background mean 
conservation and normalizing by dividing by the stan- 
dard deviation of the window. This results in normalized 
scores that are comparable between residues in different 
protein sequences, irrespective of differences in diver- 
gence patterns. Scores above 0 indicate above average 
relative local conservation. 

There are a number of alternative search strategies 
possible for exploiting information on conservation. One 
method would be to use absolute conservation level 
levels relative to orthologues from defined species. How- 
ever, previous work has shown that for discovery of 
motifs in predominantly disordered intracellular eukar- 
yotic protein regions, the relative local conservation 
compared to nearby residues of the motif is more 
powerful [19]. Accordingly, we adopted this approach, 
since the dataset to which we were applying this analysis 



are primarily intracellular motifs. Extracellular motifs, 
where there is much less protein disorder, may well 
benefit from other approaches. 

Evolutionary weighting 

The aim of evolutionary weighting is to appropriately re- 
duce the statistical support for motifs that are over- 
represented because of large-scale sequence homology 
(identity by descent) among a subset of the proteins 
investigated, rather than because of convergent evolution 
to a common motif among unrelated proteins. This is 
achieved by grouping query sequences into unrelated 
protein clusters (UPCs). Proteins within a given search 
set were analysed iteratively by BLAST (E- value thresh- 
old of 10-3) to determine relatedness. Proteins deter- 
mined by BLAST to be related were grouped into a 
UPC. Each protein in the cluster is not obviously related 
to any protein in another cluster. While a similar correc- 
tion could be more simply achieved by only choosing 
one of the related proteins in the motif search [14] the 
approach taken here is favoured because short motifs 
typically evolve faster than domains [28], so that often 
only a subset of the related proteins may possess the 
motif. The weighting of sequences occurs after the as- 
signment of proteins into UPCs. This is to ensure that 
the similarity is calculated based on the full sequence 
and not the masked regions which may be misleading in 
homology assignment. 

Profile-based discovery of motifs using meme 

MEME uses expectation maximization methods to iden- 
tify over- represented motifs in the query set [16]. The 
program was presented with the unmasked dataset, the 
masked dataset, the weighted dataset and the masked 
and weighted dataset to judge the impact of each of 
these methods on the performance. The evolutionary 
weighting was calculated using SLiMBuild [24] and the 
masking by using the RLC masking from SLiMFinder 
[15] as described above. 

The datasets were given as the input to MEME run- 
ning with the expectation that there would be zero or 
more motif instances in the query set. The minimum 
length of the motifs was set at 3 and the maximum 
length at 10. Low complexity filters were switched off. 
The profile search is carried out after the sequences are 
filtered for disorder; there is no lower limit on the length 
of disordered sequence considered, except that the motif 
discovery methods require motifs of at least three resi- 
dues in our analysis. 

MEME has an option to weight sequences. To apply 
evolutionary weighting, we incorporated knowledge re- 
garding the distances among sequences generated in the 
UPC building process to weight the sequences appro- 
priately. A minimum spanning tree normalization was 
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Table 1 Performance using experimentally validated ELMs (dataset from [14]), searching for protein short linear motifs 
in disordered regions of proteins 



ELM name 


Number of 
proteins 


Regular expression 
from ELM database 


SLIMFinder with 
evolutionary weighting 
and RLC masking (Rank) 


MEME default (Rank) 


MEME with evolutionary 
weighting and RLC 
masking (Rank) 


TRG_ER_KDEL_1 


12 


[KRH][DENQ]EL 


K.{0,2}DEL$ (1) 


DEL (35) 


DEL (1) 


LIG_Dynein_DLC8_1 


4 


[KRJ.TQT 


S..K.TQT (1) 


K[AESV]TQFE][PD] (1) 


[KV][SAE]TQT 91) 


LIG_PCNA 


13 


Q..[ILM]..[FHM][FHM] 


[IL].S[FH]F (1) 


Q.[SR^[IL][DM]SFF (1) 


[LIJ.SFF (3) 


M0D_SUM0 


29 


[VILAFP]K.[EDNGP] 


[FIV]K.E(1) 


[IV]K[QE]E[PE] (1) 


[IV]KEE (1) 


LIG_SH3_2 


9 


P..P.[KR] 


P..P.R.{0,1}P (1) 






LIG_CYCLIN_1 


22 


[RK].L.{0-1}[FYLIVMP] 


RR.{0,1}L.{0,1}F (1) 


[GE]L[St]R[ED]L.[KE][HLR]L (5) 


K[KR][KR] (1) 


LIG_CtBP 


26 


P.[DEN]L[VAS^ 


P[ILM]DL (1) 


PLDLS (1) 


PLDLS (1) 


LIG_AP_DAE_1 


8 


[DE][DES].[F].[DE] 
[LVIMFD] 


D.F..F.S..P (1) 


DDEF[GS][DE]FQ (1) 


[GA]DF (1) 


LIG_14-3-3_3 


6 


[RHK][STALV].[ST]. 
[PEDSIF] 


S.P.S.T.P (3) 


RFS]NSA (65) 


- 


LIG_RB 


25 


[LI].C.[DE] 


L.C.E (6) 


LVCFE (1) 




LIG_Clathr_ClatBox_1 


15 


l[ILM].[ILMF][DE] 


L{1,2}DL{0,2}D (12) 


[DE][ST][NSD]I[U][DE][LF] (9) 


[LG]L[DG]LD[SG](1) 


LIG_14-3-3_1 


4 


R[FSWY].S.P 


RS.S.P (3) 


RS[IPRT]S[ALMT]P (29) 


S[AI]S[ALE]P (1) 


LIG_RGD 


15 


RGD 


R.D.V (7) 


RGD (6) 


RGD (3) 


LIG_HP1_1 


6 


P.V.[LM] 








LIG_NRB0X 


9 


L..LL 




[KN] H [AKP] LLS [RN] LL[RQ] (21) 


L[KRS][QY]LL(1) 


M0D_N-GLC_2 


5 


N.C 




[FHIMY][NS][EANS][CE][VENS] 
[CEHRV]^/AF][MKL^[EAGS][NE] (42) 




TRG_lysEnd_APsAcLL 


10 


[DER]. . .L[LVI] 









All proteins contain at least one experimentally determined motif instance. The regular expression that matches the annotated ELM regular expression is returned 
for each method along with its rank (in brackets). No result is indicated by a dash. Niall Haslam 24 October 2012 09:18. 



used to weight sequences as described in SLiMDisc [26]. 
This is derived from a matrix of the sequence similarity 
of the sequences. The distances ranged between zero 
and 1, where zero indicates no similarity, and 1 indicates 
identical sequences. The weights for each sequence (W) 
were then estimated from the average distances to all 
other sequences (S d ) and the number of other sequences 
(N), as follows: 

W=(l-S d )/N 

Consider the sequences A, B, C, and D. If A and B are 
50% similar to each other and 0% similar to C and 25% 
similar to D, and D and C are 0% similar to each other. 
A and B have a distance 0.5; A and C have a distance of 
1; and A and D have a distance of 0.75. A will have a 
score of 0.56, B will have a score of 0.56, C will have 
a score of 1 and D will have a score of 0.69. This ranks 
the sequences in order of their similarity to all other 
sequences. 

Datasets 

For the purposes of benchmarking the effectiveness of 
the evolutionary weighting and relative local conservation 



masking method we took datasets from a number of pa- 
pers in the field, to facilitate comparisons among methods. 
The first dataset was from [14]. It used a gold standard lite- 
rature based dataset from the ELM database [29] . 

The second dataset is a more realistic test of the nor- 
mal operation of the program using protein-protein 
interaction data downloaded from the Human Protein 
Reference Database (HPRD) [30] taken from [19]. The 
aim of using this dataset was to test the ability of the 
program to uncover motifs in a dataset that is known to 
be noisy. Both datasets are available from the authors of 
the original manuscripts and are included in the supple- 
mentary material. 



Table 2 Summary of performance using experimentally 



validated ELMs (see Table 1) 




SLiMFinder 


MEME 
default 


MEME with weighting 
and RLC masking 


Number of First hits 


8 


6 


9 


Number in Top 10 


12 


9 


11 


Total 


17 


17 


17 


Percentage First Hit 


47% 


35% 


53% 


Percentage Top 10 


71% 


53% 


65% 



Table 3 Performance for ME ME searching for short linear motifs in a realistic motif discovery scenario 



HPRDID 


Hubgene 
symbol 


Number of 
proteins 
(with motif) 


Complex name 


Motif name 
(from ELM) 


Regular expression 
from ELM database 


SLiMFinder with 
evolutionary weighting 
and RLC Masking 


Meme default 


MEME with evolutionary 
weighting RLC masking 


150 


GRB2 


164 (146) 


Grb2 


LIG_SH3 


P..P 




- 


- 






164 (103) 




LIG_SH2_GRB2 


Y.N 


Y.N[LMV] (3) 


N K[N EK] KN RY[KV] [DN] I (2) 


KNRY[KPV][ND]ILP (1) 


215 


YWHAH 


47 (13) 


14-3-3 Eta 


LIGJ4-3-3J 


R[SFYW].S.p 


[KR]S.S.P (1) 


- 


PKIHRSASEP (17) 


350 


CLTC 


35 (15) 


Clathrin, heavy polypeptide 


LIG_Clathr_Clatbox_1 


L[ILM].[ILMF][DE] 


LLDL (4) 


- 


LLDL[EDM][DS][FA]QP(18) 


453 


CCNA2 


25 (23) 


Cyclin, A2 


LIG_CYCLIN_1 


[RK3.L{0,1}[FYLIVMP] 


- 


A[CK]R[RN]LFG (7) 


SA[CK]R[NR]LFG (8) 


607 


FNTA 


10(2) 


Farnessyltransferas e alpha 
subunit 


MO D_ ASX_betaO H_EG F 


C[ADENQ][LIVM]..$ 


- 


C[DT1IS (30) 


C[DT1IS (17) 


627 


IGTAS 


15 (7) 


Intergin alpha 5 


LIG_RGD 


RGD 




KGDRGDA (25) 


RG[DQ] (34) 


1456 


PCNA 


65 (13) 


Proliferating cell nuclear 
antigen 


LIG_PCNA 


Q..[ILM]..[FHM][FHM] 


Q..[IL]..FF (1) 


TL[YES]SFF (3) 


TL[YES]SFF (2) 


1574 


RB1 


110 (28) 


Retinoblastoma 1 


LIG_RB 


[U].C.[DE] 








3288 


PPARG 


22 (15) 


Peroxisome proliferator AR 


LIG_NRBOX 


L.LL 


L.RLL (1) 


HKILHRLLQ (4) 


[L"WS]HKLVQ[AL][IL] (D 


3334 


DYNLL1 


52 (5) 


Dynien light chain 1 


LIG_Dynien_DLC8_1 


[KR].TQT 


- 


[M\^S[CY][DS]K[ES]TQTP (95) 


KSTQT (10) 


3786 


NEDD4 


28 (17) 


n ir~ pvr\ a 

NEDD4 


1 \ A A A / 1 

LIG_WW_1 


PP.Y 


PP.Y (7) 


PPAY (83) 


PPPYSSI (2) 


3833 


TRAF6 


22 (19) 


TRAF6 


LIG_TRAF6 


.P.E..[FYWHDE]. 








4015 


CTBP1 


26 (14) 


C-terminal binding protien 


LIG_CtBP 


P.[DEN]LVAS^ 


D.P[IL]D (6) 


- 


P[LI]DLS (1) 


4946 


CCNA1 


20 (20) 


Cyclin A1 


LIG_CYCLIN_1 


[RK3.L{0,1}[FYLIVMP] 




A[CK]R[RN]LFG (3) 


A[CK]R[RN]LFG (1) 


5462 


GIPC1 


26 (7) 


GIPC1 


LIG_PDZ_1 


.[S7].[VIL]$ 


S.V$ (1) 






5639 


YWHAG 


206 (28) 


14-3-3 gamma 


LIG_14-3-3_1 


R[SFYW].S.P 


R.RS.S.S (1) 


SRSRSRS[KR]SR (51) 


[Sr^SRSRS[RK]SR (30) 


8968 


EPS15 


24 (10) 


Eps15 


LIG_EH 


NPF 


TNPF (1) 


TNPF[LS](3) 


TNPF (1) 


9045 


UBE2I 


87 (78) 


Ubiquitin conjugating 
enzyme E1 


MOD_SUMO 


[VILAFP]K.[EDNGP] 


VK.E (2) 


M[KM]VKDEY (18) 


VEIVYE (7) 


9347 


GGA2 


19(2) 


GGA2 


LIG_AP_GAE_1 


[DE][DES].F.[DE] 
[LVIMFD] 


DDF..F..A (1) 


D[DL]FG[GDE]F (6) 


D[DL]FG[GDE]F (7) 


9424 


YAP1 


15(8) 


YES associated protein 


LIG_WW_1 


PP.Y 




[PL][PD]PPY (50) 


H[C^D^][LP]PPPY (6) 



The HPRD dataset used, its name and the number of proteins in that dataset along with the ELM known to mediate some interactions with the protein hub are shown. The regular expression returned highest that matches 
the annotated ELM and its rank (in brackets) for SLiMFinder and MEME with and without RLC masking and evolutionary weighting are shown. No result is indicated with a dash. Niall Haslam 24 October 2012 09:19. 
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Results 

Evolutionary weighting and masking out non-conserved 
regions improves discovery of motifs in disordered 
regions in a standard dataset 

Table 1 shows the results for the ELM dataset of known 
motifs, occurring mainly in disordered regions of proteins. 
This compares MEMEs ability to recover known motifs 
under different conditions with a regular expression 
method, SLiMFinder. Regular expression approximations 
of the profiles are shown in Table 1 to facilitate compari- 
son of the results. Overall, it is clear that the MEME 
default is not as efficient as SLiMFinder at recovering 
known motifs (see Table 2). However, after the inclusion 
of both evolutionary weighting and masking out non- 
conserved residues, the performance of both methods are 
approximately equivalent. However, they don't give identi- 
cal results: SLiMFinder returns 3 motifs that MEME with 
weighting and masking fails to identify in the top ten 
motifs, namely SH3, 14-3-3_l and RB; the latter motif was 
identified by the default MEME programme but lost in 
the modified version. However, MEME with weighting 
and masking does identify the NRBOX motif that would 
not have been discovered by either SLiMFinder or MEME 
default, indicating that the approach may be complemen- 
tary to regular expression searching. SLiMFinder returns 
41% of the true positives whilst MEME with evolutionary 
weighting and masking returns 72%. The false positive 
rate for SLiMFinder is 59% with 28% for MEME. 

Profile methods complement regular expression methods 
with a realistic biological discovery test dataset 

Table 3 shows the results for the HPRD dataset, which 
represents a more realistic motif discovery example, 
since not all proteins contain a validated instance of 
the motif. Thus, this represents a typical question that 
would be asked by experimentalists trying to discover 
motifs in their protein-protein interaction data. In the 
case of HPRD SH2_GRB2 example, both weighting and 
masking were required to identify the motif, since ana- 
lysis without either or without both failed to identify the 
motif (Additional file 1: Table SI). A summary of the 
findings of these HPRD search results (Table 4) clearly 
indicates the benefits of applying both weighting and 
masking. MEME with weighting and masking does not 
perform quite as well in terms of identifying the true 
motif in the top of the list (Table 4); but it is equivalent 
in terms of identifying the true motif within the top 10, 
identifying 4 motifs that SLiMFinder failed to identify in 
the top 10 (Table 3). SLiMFinder returns 21% of the true 
positives whilst MEME with evolutionary weighting and 
masking returns 9%. Again, this indicates that the two 
methods are complementary. 

Thus, the addition of the evolutionary weighting and 
RLC masking is able to increase the ability of MEME to 



Table 4 Summary of the results in a realistic motif 
discovery scenario (see Table 3) 



SLiMFinder MEME MEME with weighting 
default and RLC masking 



Number returned 


12 


14 


17 


Number of first 
hits 


7 


0 


5 


Total 


21 


21 


21 


Number in Top 
Ten 


12 


6 


12 


Percentage 
Returned 


76% 


67% 


81% 


Percentage Top 
10 


57% 


29% 


57% 


Percentage Top 
Hit 


33% 


0% 


24% 



identify the correct motif in the top 10 results returned. 
This indicates the likely benefits of including evolutionary 
weighting and RLC masking in de novo motif discovery, 
particularly for motifs lying in structurally disordered 
regions, which are strongly represented within both test 
datasets. 

It might be expected that the motifs returned by 
MEME are richer, containing more information, since 
the profile representation has the potential to encode 
more information. Table 5 shows the SeqLogo rep- 
resentation of a sample of the motifs. Additional file 2: 
Table S2 indicates that for the ELM dataset, of motifs 
returned by both methods, for eight of them, MEME 
was more informative (motif less likely to occur by 
chance), one was equal, and for two SLiMFinder was 
more informative. For the more biologically realistic 
HPRD dataset (Additional file 3: Table S3), all 17 com- 
parable motifs were more informative for MEME. This 
feature of MEME, that it tends in a realistic setting to 
return more informative motifs, may be valuable, since 
it may be detecting true subtleties that are missed by 
regular expression searching. However, we cannot ex- 
clude the possibility that it is simply over-fitting to the 
available data. 

In the example of Lig_Dynein the profile accurately 
captures the lack of flexibility in the final three positions. 
These three residues are all well conserved and contri- 
bute hydrogen bonds to the interaction. We examined 
the available structures using the ligplot software [31], 
where the profile allows Serine, Glutamic Acid and Ala- 
nine at position two in the motif. No structures were 
available for the case of Valine at position 1. In the case 
of Serine (PDB id: 1 F95) and Alanine (PDB id: 3E2B) at 
position two, there is no evidence of these residues con- 
tributing hydrogen bonds to the binding [32,33]. In the 
case of Glutamic Acid (PDB id: 2PG1) it does contribute 
a hydrogen bond to the interaction [34]. Thus, motifs 
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Table 5 Sample of Sequence Logos from the MEME output from Table 1 



ELM name 



MEME defined regular expression 



SeqLogo from MEME 



Lig_Dynein 



Lig PCNA 



[KVKSAETTQT 



[LI].SFF 








4- 








3- 




Lig Cyclin 1 


K[KR][KR] 




I 






1 • 








J 


Fao 
-Nrt 



:i,r iro::ci::-«ii 1213 



Lig CtBP 



PLDLS 



VI 

§2- 



1 



L 



s 



0 1- CM CO <T W 

«rtno::i: ao:n 13 19 



Lig KDEL 



DEL 



52 
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with a charged rather than a small residue at position 
two may have a distinct mode of binding. 

The approaches described here will be useful for pro- 
teomics experiments where the user expects that asso- 
ciations and interactions are mediated by short linear 
motifs. Accordingly, we have made scripts available to 
facilitate calculation of the weights for submission to 
MEME, as well as for the masking out of non-conserved 
residues at http://bioware.ucd.ie/meme.html. Developers 
interested in contributing to the further development of 
this freely available software are invited to apply for ac- 
cess to the subversion software repository. Applications 
of this approach should cite this paper, but also MEME 
[16]. Other tools employed such as IUPred [27], BLAST 
[35], Muscle [25], ClustalW [36] should be acknowledged 
as appropriate. 

Discussion 

In the search for linear protein motifs the recent focus 
has centred on regular expressions. Whilst the methods 
that have transferred from DNA transcription factor 
searching such as MEME and NestedMica [37] have 
been applied to motif searching in proteins, profile- 
based methods have not typically been applied in SUM 
databases such as ELM, phospho.ELM and MiniMotif. 
Regular expression based definitions have been preferred 
for a number of reasons, from ease of use to the fact that 
they can include subjective annotations from expert 
curators of the databases. However, in the move to more 
automated methods of protein linear motif discovery, 
there are a number of advantages to incorporating 
profile-based definitions, as these may increase the in- 
formativeness of the motifs under certain circumstances. 

It is of interest to speculate on possible reasons why 
the MEME-based approach adopted here may give dif- 
ferent results to regular expression approaches such as 
SLiMFinder. One possible reason is that certain motifs 
may evolve in a way that is best described by a regular 
expression, and therefore regular expressions will have 
more power to detect them, whilst other motifs evolve 
in a way that is more easily captured by a profile repre- 
sentation, with requirements for a specific subset of 
residues at certain positions that do not match the com- 
mon ambiguity sets implemented in regular expression 
searching. We would have anticipated that profile repre- 
sentations will become more powerful for motifs with 
many occurrences, since the profile definitions are more 
approximate when there are only a few sequences, but 
we find no clear suggestions to date that this is the case 
(Tables 1 and 2). The two search strategies differ in an- 
other subtle way: with the MEME approach, the se- 
quence weighting occurs before motif discovery, whereas 
with SLiMFinder motifs are selected within the entire 
protein dataset and then ranked afterwards, on the basis 



of statistical support. It is possible that this may favour 
SLiMFinder whenever a motif is found in only a small 
subset of the related proteins. In the 8 test datasets 
where the motif is found in less than 30% of the interac- 
ting proteins, SLiMFinder finds the correct motif more 
often than MEME, 5 times compared to 3 times. 

The discovery of linear motifs in interaction sets and 
proteomics experiments is only one step in the process of 
determining the functionality of proteins or sets of pro- 
teins in an interaction network. Profile-based methods will 
be useful in the process of searching for further instances 
of motifs identified in one experiment. Given the success 
of profile-based searching methods in complementing se- 
quence based searching in domain recognition, [38,39], 
we anticipate that profile-based approaches should also 
take their place alongside regular expression methods in 
SLiM identification (e.g. SLiMSearch [40]). 

Additional files 



Additional file 1: Table SI. Summary of Performance using a realistic 
motif discovery scenario, searching for protein short linear motifs in 
disordered regions of proteins with MEME. 

Additional file 2: Table S2. Probability that a motif occurs by chance as 
an indicator of information content of the motif representation. Pmotif 
scores [15] relate to results from Table 1. 

Additional file 3: Table S3. Probability that a motif occurs by chance as 
an indicator of information content of the motif representation. Pmotif 
scores [15] relate to results from Table 2. 
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